IMPORTING OF LIBRARIES AND CSV FILE¶

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
cm = pd.read_csv('CarPrice_data.csv')
In [2]:
cm
Out[2]:
car_ID symboling CarName fueltype aspiration doornumber carbody drivewheel enginelocation wheelbase ... enginesize fuelsystem boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
0 1 3 alfa-romero giulia gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 13495.0
1 2 3 alfa-romero stelvio gas std two convertible rwd front 88.6 ... 130 mpfi 3.47 2.68 9.0 111 5000 21 27 16500.0
2 3 1 alfa-romero Quadrifoglio gas std two hatchback rwd front 94.5 ... 152 mpfi 2.68 3.47 9.0 154 5000 19 26 16500.0
3 4 2 audi 100 ls gas std four sedan fwd front 99.8 ... 109 mpfi 3.19 3.40 10.0 102 5500 24 30 13950.0
4 5 2 audi 100ls gas std four sedan 4wd front 99.4 ... 136 mpfi 3.19 3.40 8.0 115 5500 18 22 17450.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
200 201 -1 volvo 145e (sw) gas std four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 23 28 16845.0
201 202 -1 volvo 144ea gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 8.7 160 5300 19 25 19045.0
202 203 -1 volvo 244dl gas std four sedan rwd front 109.1 ... 173 mpfi 3.58 2.87 8.8 134 5500 18 23 21485.0
203 204 -1 volvo 246 diesel turbo four sedan rwd front 109.1 ... 145 idi 3.01 3.40 23.0 106 4800 26 27 22470.0
204 205 -1 volvo 264gl gas turbo four sedan rwd front 109.1 ... 141 mpfi 3.78 3.15 9.5 114 5400 19 25 22625.0

205 rows × 26 columns

In [3]:
cm.describe()
Out[3]:
car_ID symboling wheelbase carlength carwidth carheight curbweight enginesize boreratio stroke compressionratio horsepower peakrpm citympg highwaympg price
count 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000 205.000000
mean 103.000000 0.834146 98.756585 174.049268 65.907805 53.724878 2555.565854 126.907317 3.329756 3.255415 10.142537 104.117073 5125.121951 25.219512 30.751220 13276.710571
std 59.322565 1.245307 6.021776 12.337289 2.145204 2.443522 520.680204 41.642693 0.270844 0.313597 3.972040 39.544167 476.985643 6.542142 6.886443 7988.852332
min 1.000000 -2.000000 86.600000 141.100000 60.300000 47.800000 1488.000000 61.000000 2.540000 2.070000 7.000000 48.000000 4150.000000 13.000000 16.000000 5118.000000
25% 52.000000 0.000000 94.500000 166.300000 64.100000 52.000000 2145.000000 97.000000 3.150000 3.110000 8.600000 70.000000 4800.000000 19.000000 25.000000 7788.000000
50% 103.000000 1.000000 97.000000 173.200000 65.500000 54.100000 2414.000000 120.000000 3.310000 3.290000 9.000000 95.000000 5200.000000 24.000000 30.000000 10295.000000
75% 154.000000 2.000000 102.400000 183.100000 66.900000 55.500000 2935.000000 141.000000 3.580000 3.410000 9.400000 116.000000 5500.000000 30.000000 34.000000 16503.000000
max 205.000000 3.000000 120.900000 208.100000 72.300000 59.800000 4066.000000 326.000000 3.940000 4.170000 23.000000 288.000000 6600.000000 49.000000 54.000000 45400.000000
In [4]:
cm.isnull().sum()
Out[4]:
car_ID              0
symboling           0
CarName             0
fueltype            0
aspiration          0
doornumber          0
carbody             0
drivewheel          0
enginelocation      0
wheelbase           0
carlength           0
carwidth            0
carheight           0
curbweight          0
enginetype          0
cylindernumber      0
enginesize          0
fuelsystem          0
boreratio           0
stroke              0
compressionratio    0
horsepower          0
peakrpm             0
citympg             0
highwaympg          0
price               0
dtype: int64

EXPLORATORY DATA ANALYSIS¶

UNIVARIATE ANALYSIS¶

In [5]:
cm['CarName'].value_counts()
Out[5]:
toyota corona           6
toyota corolla          6
peugeot 504             6
subaru dl               4
mitsubishi mirage g4    3
                       ..
mazda glc 4             1
mazda rx2 coupe         1
maxda glc deluxe        1
maxda rx3               1
volvo 246               1
Name: CarName, Length: 147, dtype: int64
In [6]:
cm['fueltype'].value_counts()
Out[6]:
gas       185
diesel     20
Name: fueltype, dtype: int64
In [7]:
sns.countplot(x='doornumber', data=cm)
Out[7]:
<AxesSubplot:xlabel='doornumber', ylabel='count'>
In [8]:
cm['enginelocation'].hist()
Out[8]:
<AxesSubplot:>
In [9]:
cm['drivewheel'].hist()
Out[9]:
<AxesSubplot:>
In [10]:
cm.hist(column='wheelbase', bins=120, edgecolor='black')
Out[10]:
array([[<AxesSubplot:title={'center':'wheelbase'}>]], dtype=object)
In [11]:
cm.hist(column='enginesize',bins=30)
Out[11]:
array([[<AxesSubplot:title={'center':'enginesize'}>]], dtype=object)

BIVARIATE ANALYSIS¶

In [12]:
sns.barplot(x='doornumber', y='price', data=cm)
Out[12]:
<AxesSubplot:xlabel='doornumber', ylabel='price'>
In [13]:
sns.lmplot(x='price', y='enginesize', data=cm, palette='coolwarm')
Out[13]:
<seaborn.axisgrid.FacetGrid at 0x1afad2c5550>

MULTIVARIATE ANALYSIS¶

In [14]:
from pandas_profiling import ProfileReport
In [15]:
ProfileReport(cm)
Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]
Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]
Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]
Out[15]:

In [16]:
sns.boxplot(x='price', y='aspiration', hue='fueltype', data=cm)
Out[16]:
<AxesSubplot:xlabel='price', ylabel='aspiration'>

INSIGHTS FROM THE EDA¶

  • This is a study of 205 cars
  • there are only two types of car hamdles in this study which are ; two handles, and four handles
  • the cars use either gas or deisel
  • the xars that use gas are way more than those that use deisel
  • the toyota corolla and toyota corona have the most number of cars
  • most of the car engines are at the front
  • there is more standard aspiration cars than the turbo aspiration cars

MODELING¶

TRAINING THE MODEL¶

In [18]:
x = cm[['symboling', 'wheelbase', 'enginesize', 'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg']]
y = cm[['price']] 
In [19]:
from sklearn.model_selection import train_test_split
In [20]:
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
In [21]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=80)
In [22]:
from sklearn.linear_model import LinearRegression 
In [23]:
lm = LinearRegression()
In [24]:
lm.fit(x_train,y_train)
Out[24]:
LinearRegression()
In [25]:
lm.intercept_
Out[25]:
array([-21678.14493683])
In [26]:
lm.coef_
Out[26]:
array([[ 1.16732155e+02,  1.97355388e+02,  1.12745187e+02,
        -6.51713714e+02, -3.11424065e+03,  3.73294977e+02,
         3.20095107e+01,  2.04564485e+00, -3.17387546e+02,
         1.13043354e+02]])
In [27]:
x_train.columns
Out[27]:
Index(['symboling', 'wheelbase', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg', 'highwaympg'],
      dtype='object')
In [28]:
prediction = lm.predict(x_test)
prediction
Out[28]:
array([[17158.29005137],
       [ 6476.56656474],
       [11471.69313193],
       [ 8097.50837579],
       [10009.16915121],
       [16520.19436032],
       [ 9892.43699638],
       [ 9116.86004667],
       [ 4915.1043015 ],
       [11386.91882686],
       [14860.31971982],
       [ -685.56432405],
       [15658.1681621 ],
       [21918.42573349],
       [45007.97195302],
       [ 8574.90512583],
       [ 6476.56656474],
       [ 6476.56656474],
       [16202.80681454],
       [26411.28897232],
       [ 9704.9206041 ],
       [30120.35454531],
       [ 9704.9206041 ],
       [39445.20029858],
       [ 9285.33442116],
       [16645.33973797],
       [10374.54175519],
       [ 5508.7945314 ],
       [ 5414.40479067],
       [ 6100.34486001],
       [ 6368.71020069],
       [ 5129.94225586],
       [16725.56833511],
       [19367.15737013],
       [13954.68221463],
       [26662.72829261],
       [13954.68221463],
       [17099.03947467],
       [10535.34592566],
       [10246.8928172 ],
       [10102.11033541],
       [26411.28897232],
       [37354.48875405],
       [ 7679.3958844 ],
       [13734.96846796],
       [ 5508.7945314 ],
       [ 6476.56656474],
       [ 6424.77697572],
       [11818.49719728],
       [14108.0684263 ],
       [14108.0684263 ],
       [10823.14702256],
       [ 7378.52805754],
       [17668.60955565],
       [24687.92862319],
       [ 7118.71664799],
       [17119.70252435],
       [ 9473.94798088],
       [-1864.0133768 ],
       [ 6424.77697572],
       [13954.68221463],
       [30225.91911478]])
In [29]:
y_test
Out[29]:
price
111 15580.0
121 6692.0
143 9960.0
138 5118.0
61 10595.0
... ...
183 7975.0
18 5151.0
96 7499.0
167 8449.0
71 34184.0

62 rows × 1 columns

In [30]:
from sklearn import metrics 
In [31]:
metrics.mean_absolute_error(y_test,prediction)
Out[31]:
2789.1433971475385
In [32]:
metrics.mean_squared_error(y_test,prediction)
Out[32]:
16153787.360080523
In [33]:
np.sqrt(metrics.mean_squared_error(y_test,prediction))
Out[33]:
4019.1774481951556
In [34]:
r_squared = lm.score(x_test,prediction)
r_squared
Out[34]:
1.0
In [35]:
r2_score(y_test,prediction)
Out[35]:
0.8499688409806188

TESTING THE LINEAR MODEL¶

In [36]:
y_pred = lm.predict(x_test)
y_pred
Out[36]:
array([[17158.29005137],
       [ 6476.56656474],
       [11471.69313193],
       [ 8097.50837579],
       [10009.16915121],
       [16520.19436032],
       [ 9892.43699638],
       [ 9116.86004667],
       [ 4915.1043015 ],
       [11386.91882686],
       [14860.31971982],
       [ -685.56432405],
       [15658.1681621 ],
       [21918.42573349],
       [45007.97195302],
       [ 8574.90512583],
       [ 6476.56656474],
       [ 6476.56656474],
       [16202.80681454],
       [26411.28897232],
       [ 9704.9206041 ],
       [30120.35454531],
       [ 9704.9206041 ],
       [39445.20029858],
       [ 9285.33442116],
       [16645.33973797],
       [10374.54175519],
       [ 5508.7945314 ],
       [ 5414.40479067],
       [ 6100.34486001],
       [ 6368.71020069],
       [ 5129.94225586],
       [16725.56833511],
       [19367.15737013],
       [13954.68221463],
       [26662.72829261],
       [13954.68221463],
       [17099.03947467],
       [10535.34592566],
       [10246.8928172 ],
       [10102.11033541],
       [26411.28897232],
       [37354.48875405],
       [ 7679.3958844 ],
       [13734.96846796],
       [ 5508.7945314 ],
       [ 6476.56656474],
       [ 6424.77697572],
       [11818.49719728],
       [14108.0684263 ],
       [14108.0684263 ],
       [10823.14702256],
       [ 7378.52805754],
       [17668.60955565],
       [24687.92862319],
       [ 7118.71664799],
       [17119.70252435],
       [ 9473.94798088],
       [-1864.0133768 ],
       [ 6424.77697572],
       [13954.68221463],
       [30225.91911478]])
In [37]:
from sklearn.metrics import accuracy_score
accuracy = lm.score(x_test,y_pred)
In [38]:
accuracy
Out[38]:
1.0
In [39]:
x_test
Out[39]:
symboling wheelbase enginesize boreratio stroke compressionratio horsepower peakrpm citympg highwaympg
111 0 107.9 120 3.46 2.19 8.4 95 5000 19 24
121 1 93.7 90 2.97 3.23 9.4 68 5500 31 38
143 0 97.2 108 3.62 2.64 9.0 94 5200 26 32
138 2 93.7 97 3.62 2.36 9.0 69 4900 31 36
61 1 98.8 122 3.39 3.39 8.6 84 4800 26 32
... ... ... ... ... ... ... ... ... ... ...
183 2 97.3 109 3.19 3.40 9.0 85 5250 27 34
18 2 88.4 61 2.91 3.03 9.5 48 5100 47 53
96 1 94.5 97 3.15 3.29 9.4 69 5200 31 37
167 2 98.4 146 3.62 3.50 9.3 116 4800 24 30
71 -1 115.6 234 3.46 3.10 8.3 155 4750 16 18

62 rows × 10 columns